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Summary 

A new parallel numerical scheme for solving incompressible 
steady-state flows is presented. The algorithm uses a finite- 
difference approach to solving the Navier-Stokes equations. 
The algorithms are scalable and expandable. They may be 
used with only two processors or with as many processors as 
are available. The code is general and expandable. Any size 
grid may be used. Four processors of the NASA Lewis 
Research Center Hypercluster were used to solve for steady- 
state flow in a driven square cavity. The Hypercluster was 
configured in a distributed-memory, hypercube-like architec- 
ture. By using a 50-by-50 finite-difference solution grid, an 
efficiency of 74 percent (a speedup of 2.96) was obtained. 

Introduction 

Over the years, as thrust-to-weight ratio has increased, 
propulsion systems have evolved into very sophisticated 
designs, requiring intricate interaction among an enormous 
number of complicated components. This complexity is the 
result of trying to harness as much power as possible from a 
propulsion system of given size and weight. 

As thrust-to-weight ratio continues to increase in the future, 
the propulsion system designer will have to rely heavily on 
high-speed computer simulations. Today, the simulation of a 
modem propulsion system on state-of-the-art serial computers 
may consume hundreds of hours of execution time to achieve 
the level of detail required. Yet, to be of help to the propulsion 
system designer, simulation results must be available in a 
reasonable amount of time. 


Because for many years simulation complexity has been 
increasing faster than the mainframe computer speeds capable 
of handling it, other techniques have been investigated to 
reduce the effective wallclock time to completion. Parallel 
processing has gained considerable popularity in the last few 
years, and much new commercial hardware is available today. 
But the concept of having several computers working in 
unison on the same problem has been around for a long time. 
For example, during the early 1980’s researchers at the NASA 
Lewis Research Center demonstrated the feasibility of parallel 
processing in engine simulation (refs. 1 through 5). The 
researchers executed a small helicopter engine simulation by 
using four processors in unison on a parallel processing system 
developed in-house at NASA Lewis (ref. 6). 

A characteristic of parallel processing that has become very 
apparent today is that hardware is leading the software 
available by a considerable amount. Powerful hardware is 
commercially available, but the software to harness that 
powerful hardware is not. This is particularly apparent in the 
area of fluid dynamics. Fluid dynamics problems are generally 
very computationally intensive, requiring solutions of systems 
of partial differential equations over huge one-, two-, and 
three-dimensional grids. One characteristic of these problems 
is that they often require solution of block-diagonal systems 
over finite-difference grids. 

This report discusses techniques for rapidly solving the 
steady-state, incompressible Navier-Stokes equations by using 
parallel processing and finite-differencing techniques. 
Algorithms are presented that are scalable, general, and 
portable. The algorithms can be used with as few as two 
processors or, for a large enough problem, with as many 
processors as are available. Any size problem, within the 
limits of the computer’s memory, can be solved. Care was 
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taken to minimize the effects of machine-specific items in 
order to ease porting of the code between multiple-input, 
multiple-data (MIMD) distributed-memory computers. 
Machine-specific items, such as COMMON and vector 
processing, are not used in the code. For demonstration 
purposes the results that were obtained from applying the 
algorithms to three classical fluid dynamics problems are 
presented, including solving for the steady-state flow within a 
driven square cavity. Finally, a technique for making the 
algorithms even faster is discussed. 

Algorithm Overview 


Let us also choose the following convention. Let x be defined 
as the values at the ith iteration level. Let x be the values at 
the (/- l)th iteration level. Hence u is the x-component of velo- 
city at the ith iteration level and u is the ^-component of 
velocity at the (/— l)th iteration level. 

If we define the operators 

C = u * — + V * JL (5) 

dx dy 

and 

L = C - _L V 2 (6) 

Re 


With the aggressive advances taking place in parallel 
processing hardware, new software and algorithms must be 
developed to harness this increase in computing capability. 
Because of the complicated nature of fluid dynamics problems 
many studies are attempting to develop new techniques and 
algorithms for taking advantage of parallelism in these compu- 
tationally intensive problems. One such technique is the finite- 
difference algorithm that was developed by A. Lin for solving 
the steady-state incompressible Navier-Stokes equations. This 
algorithm, which has the potential to provide a scalable, paral- 
lel solution, has been well documented over the past several 
years (refs. 7 through 10). For convenience, the key features 
of the algorithm are highlighted here. For a formal and 
detailed development of these concepts, consult the references. 

The steady-state incompressible Navier-Stokes equations can 
be written in the dimensionless form: 

Continuity equation: 

U x + V y = 0 ( 1 ) 

^-momentum equation: 

uu x + vu y + Px = J- V 2 « (2) 


and apply the following second-order quasi-linearization, given 
the general term f(x)g(x). 


fg a f*g + g*f~ f*8 * 


(7) 


Then the system of equations shown in equations (1) through 
(3) can be solved iteratively as 


where 


Ax = nx 


( 8 ) 


K 


A = 


dx 


L + 


K± 0 

dy 

► * d 

Uy ~dx 

i * 0 

+ Vy 5L 


(9) 


y-momentum equation: 




vv y + Pj 


= _L_V 2 v 
Re 


(3) 


where g x = dg/dx, p is the pressure function, u is the 
^-component of velocity, v is the y-component of velocity, and 
Re is the Reynolds number. This is a nonlinear system of 
elliptic partial differential equations that, in general, must be 
solved iteratively. 

Define x» the vector of variables, as 


X = 


v 

\p. 


(4) 


and 



0 

0 

C 


0> i 

0 


( 10 ) 


where K is a constant with the same units as the velocity. The 
continuity equation is multiplied by K in order to bring it to 
the same units as the momentum equations to ease eigenvalue 
analysis. 

The iteration matrix for equation (8) is 


2 


5 = A' 1 !! 


( 11 ) 

All the eigenvalues of J; for the continuous problem are 
bounded by 1 for all values of K * 0 when x is sufficiently 
close to x- 

Equation (8) can be solved by using the finite-difference 
numerical scheme shown in figure 1. Note that the primitive 
variables are staggered over the finite-difference grid. 
Boundary conditions for the internal flow field are also shown 
in the figure. Usually some artificial computational cells are 
added by appending two adjacent boundaries (i.e., i = 1 and 
j - 1). This is done to make the application of the algorithm 
easier. As shown in the figure, along the left column of cells 
appended, 


Let us define the grid and flow parameters 

a = — and (3 = _J L (14) 

k Re h 

where h is the finite-difference grid spacing in the i- or 
.x-direction and k is the finite-difference grid spacing in the 
j- or y-direction. 

A possible finite-difference approximation for the system 
given by equation (8) can be expressed as follows: The dis- 
crete continuity equation is 



< v i,y 



( 12 ) 


And the discrete momentum equations are 


where bj denotes the yth boundary value of the y-component of 
velocity. Along the bottom row of cells appended, 

(«,-,! + «/, 2) /2 = u b i ( 13 ) 


where b i denotes the zth boundary value of the ^-component of 
velocity. The pressure is zero within all the cells appended. 
The pressure variable does not appear on the boundary be- 
cause of the nature of the Navier-Stokes equations, as dis- 
cussed in references 9 and 10. 


v m,n 


= v b 


m 



Figure 1. — Variable locations and internal flow boundary conditions, 
where p is pressure function, u is x-component of velocity, v is y- 
component of velocity, u b is boundary value of x-component of 
velocity, and v b is boundary value of y-component of velocity. 


Ae kJ + Ipi+u - Pi) - p(“*+w - m «-iJ 

- Pa 2 («,j +1 - 2(3(1 + a 2 )^ 


and 


+ h 


U x lifij + \ U j )i/ij 


ij 


= 0 


(16) 


hc(vj * a(p iJ+l - pj - p(v f+u - Vj-jJ 

" P« 2 ( v /j + i - \ri) + 2 P( ! + « 2 ) v /j 
+ h 


v j li/u + V • * lif'J 


= 0 


(17) 


where C is the second-order, upwind, finite-difference scheme 
of the convection operator C given in equation (5). In evalu- 
ating C, the finite difference of the first derivative is taken as 
a forward difference if the variable is negative (((f)*)/ = (<j> /+1 
- (J ) i )/h or ((j) >( )y = ((j )j +l - (j >j)/k) and as a backward difference 
if the variable is positive ((c ) ) x ) t = (cj) / - (J> / _ 1 )//j or ((j )y)j = 
(4>y - §j_ x )lk). This is done to ensure the diagonal dominance 
of the system. 

Equations (1) through (3) depend on the change in pressure, 
not the pressure level itself. Thus the pressure level must be 
set. This can be done by enforcing a fixed pressure level in 
one of the cells. Although any cell may be chosen, a boundary 
or comer cell is generally chosen for convenience (see fig. 1). 
Equations (15) through (17) represent a second-order approxi- 
mation to the Navier-Stokes equations. They reflect the fact 
that the total mass flux through the domain D. is preserved. 
The scheme allows for the conservation of mass at each 
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computational cell at every iteration level. The conservation of 
mass is an important condition of steady-state incompressible 
flow and should be maintained. 

Parallel Environment and Computation 
Scheme 

As shall be seen later in this section when the parallel 
computation scheme is discussed, the algorithm highlighted 
previously lends itself quite nicely to solution in a parallel 
environment. The parallel environment chosen for this study 
was the Hypercluster, a versatile MIMD parallel processing 
testbed that has been developed at NASA Lewis for algorithm 
and architecture research (refs. 11 and 12). This section 
focuses on describing the Hypercluster architecture and the 
parallel computational scheme for solving equation (8). 

Hypercluster Architecture 

The Hypercluster parallel environment is an experimental 
machine that has been specifically designed to perform well on 
computational fluid dynamics (CFD) problems. The rationale 
behind the design of the Hypercluster testbed evolved from the 
complexity of investigating CFD algorithms on parallel and 
vector hardware. This rationale is discussed in reference 1 1. A 
four-node Hypercluster consists of four clusters of processors 
that are interconnected in a binary rt-cube configuration by 
intemode links. The cluster of processors at a node may con- 
sist of any number of scalar and/or vector processors. The 
architecture for one such four-node prototype is shown in 
figure 2. The links allow communication between nodes and 
consist of two control processors (CP’s) communicating 
through dual-ported memory. It is important to note that the 
CP’s can route information throughout the Hypercluster with- 
out interrupting computation of the current application. Thus, 
communication occurs in parallel with computation. 

The Hypercluster resembles other hypercube architectures in 
many respects but has the following important differences: 

(1) Each node of the Hypercluster can consist of more than 
one scalar and/or vector processor. 

(2) Processors within a node communicate through shared 
memory. 

(3) Independent CP’s perform the intemode communication 
without interrupting the processors within a node that are 
performing the application computations. These CP’s are 
programmable, allowing investigation of various message- 
passing protocols. 

(4) The processor technology within each node is not limited 
to any particular vendor. The use of a standard bus allows any 


processor board that is available for the bus to be incorporated. 
This feature also provides a rapid method for upgrading the 
system hardware. 

The user interacts with the hardware through a menu-driven 
Hypercluster operating system (HYCLOPS), which runs on the 
front-end processor. Hypercluster programs are written in For- 
tran 77, with library support for parallel and vector processing. 
Each processor in the Hypercluster runs a message-passing 
kernel that supports interprocessor communication. The kernel 
supports distributed-memory communication through simple 
“send” and “receive” library calls. Shared-memory communi- 
cation is supported through COMMON blocks. Details on the 
Hypercluster programming and operating environment can be 
found in reference 12. 

The Hypercluster provides a versatile environment that 
allows experimentation with many different subset archi- 
tectures and programming styles. Because of the interest in 
distributed-memory systems this mode of Hypercluster opera- 
tion was selected for the test cases presented herein. 

Parallel Computation Scheme 

An ideal application of the numerical scheme just described 
is to solve linearized systems iteratively. For any grid point 
(ij) the following linear relationship holds: 

P L/X, V + j + N + E/jX/+i j 

+ = ( RHS hj 


CL communication link 
M shared memory 
CP control processor 
SP scalar processor 
VP vector processor 



Front-end processor bus 

Figure 2. — Two-dimensional Hypercluster configuration. 
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where the coefficient matrices P, W, N, E, and S are derived 
from the operator A by using the discrete finite-difference 
equations (i.e., equation (9) together with equations (15) 
through (17)), x is the value of (w, v, p) r at the ki h iteration 
level, and RHS is a known term that depends on or the 

values at the (/:— 1 )th iteration level. 

Several parallel iterative methods for solving equation (18) 
have been discussed previously (refs. 13 and 14). The iterative 
approach that follows, which is called line relaxation, is well 
suited for use with distributed-memory M1MD parallel compu- 
ters because communication complexity is minimal. 

The iteration equation that is considered by the line- 
relaxation iterative algorithm is the following block-tridiagonal 
system: 

A r x r -! + B r x r + Cx r+1 = D r , 1 < r < n (19) 


new second equation by A 3 (5 2 ) 1 and subtract from the third 
equation, we eliminate the A 3 coefficient and get new values 
for the #3 and D 3 coefficents. Continue repeating this process 
(key j — 1) times until coefficient A keyi is eliminated. 

Now go to the last equation on the processor. Multiply this 
equation by C key 2 - 2 (Z? ke y 2 _i) 1 and subtract the result from the 
second-last equation. The C^ ey ^ 2 coefficient is eliminated and 
the A ke y 2 _ 2 coefficient is left alone. Coefficient B kty ^_ 2 is 
replaced with a new-valued coefficient, and a new coefficient 

C key 2 -2 = - C key 2 -2( fi key 2 -l)"' C key 2 -l is introduced. This COef- 

ficient multiplies Xkey. anc * this term will carry along in each 

step as we repeat the process to eliminate the C coefficients 
back to the key equation. Doing so, the key equation on 
processor 1 will be transformed into an equation of the form 


This system is obtained from equation (18) by deciding upon 
the implicit line, either i or y, moving the off-line terms to the 
right-hand side of the equation, and combining them with the 
RHS term to form the D term. The parallel algorithm for solv- 
ing equation (19) is based on a more general approach that is 
discussed in reference 15. 

The scheme for solving equation (19) is as follows: If q 
processors are available to solve the system, split the n equa- 
tions into q strips. Accomplish this by identifying q internal 
equations, or key equations. Denote by key,- the location of 
these equations so that 1 < keyj < key 2 < ... < key^ < n. For 
the remainder of this discussion, the following convention is 
used: Denote by the rth strip the set of equations from the 
(key r _j+ 1) equation through the (key r+1 ~ 1) equation. For com- 
pleteness, let key 0 = 0 and key^ +1 = (fl+1). 

Note that there is now some overlapping of the equations in 
the strips. For example, as shown in figure 3, the rth and the 
(r-l)th strips have the equations (key r _j+l) through (key r -l) 
in common. Likewise, the rth and the (r+l)th strips have the 
equations (key r +l) through (key r+1 ~l) in common. As ex- 
plained in the discussion that follows, the reason for this 
overlapping is to get a block-tridiagonal relationship between 
Xkey. j’ Xkey,’ and Xkey |+1 for each P rocessor 1 used in the 
parallel solution of the problem. 

The parallel solution of equation (19) takes place in the 
following fashion: Since we have q processors to solve equa- 
tion (19), we assign to the rth processor, 1 < r < q, the rth 
strip of equations, namely, the equations (key^+1) through 
(key r+1 -l). 

For the moment consider the equations on processor 1 . The 
system of equations is as shown in figure 4. Multiplying the 
first equation by A 2 and subtracting from the second equa- 
tion, we see that the A 2 coefficient in the second equation is 
eliminated and the C 2 coefficient is left alone. In addition, the 
#2 and D ] 2 coefficients are replaced with the modified coeffi- 
cients B 2 and D 2 . If we repeat the process and multiply this 


^keyjXkeyj + ^key 1 Xkey 2 ^key 1 ^ ^ 

Let us now turn our attention to processors 2 through (g-l). 
The system of equations on processor 2 is as shown in fig- 
ure 5. The same process is used to reduce this set of equations 
as was used on processor 1. The only difference is that in the 


key r _ 2 

r (key r _ 2 + 1 ) 


(r- 1) th strip 


(key^ -1) 
key r _i 

(key r _-| + 1 ) -i 


L (key r -1) 
key r 

r (key r + 1 ) 


r th strip 


(r + 1 ) th strip 


(key r+1 -1) J 
key r+ i 
(key r+1 +1) 


L (key r+2 - 1 ) 
key r+2 


Figure 3. — Calculation strip layout. 
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B 1 C, 

A2 B2 C 2 

A3 83 C3 

A4 84 C4 


^key-j ®key 1 ^key-j 


^key 2 -2 ®key 2 -2 ^key 2 -2 


^key 2 -1 ®key 2 -1 Ckey 2“1 


Figure 4. — Processor 1 equations. 



- ATI - 


r Di 1 


X2 


o 2 


*3 


d 3 


*4 

• 


d 4 

• 


• 

*keyi 

• 

• 


• 

D key 1 

• 

• 


• 

Vkey 2 -2 


• 

^key 2 -2 


Vkoyg-I 


D koy 2 -1 

- 

_ *key 2 _ 


- 


A key -j+1 ^key-j+1 ^key-j+l 

A key 1 +2 ®key 1 +2 ^key-j+2 


^ key i+3 ^keyi+3 ^keyi+3 

A key i+4 ®keyi+4 ^keyi+4 


Ak< 


;ey 2 


*key 2 


■'key 2 


^key 3 -2 8 key 3 -2 c key 3 -2 

^key 3 -1 e key 3 -1 ^key 3 -1 

Figure 5. — Processor 2 equations. 



*keyi 


- “ 


*keyi +1 


^keyi+1 


*keyi+2 


^keyi+2 


-^keyi+3 


D key.,+3 


*keyi +4 

• 

• 


^k©yi+4 

• 

• 


• 

Afkey 2 

• 

• 


• 

D Key 2 

• 

• 


• 

*k©y3~2 


• 

D key 3 -2 


*key3~1 


D key 3 -1 

- 

_ Afkey^ _ 


- 


first equation the A keyi+1 coefficient is not zero. Therefore, as 
we reduce the equations forward toward equation key 2 , a term 
gets carried along that multiplies Xkeyf Hence, when the 

operations are completed on processor 2, the key equation on 
that processor will be of the form 


'key^key 


^key ^key 


* ” 


(23) 


* * * * 
/ ^key 2 ^key 1 + ^key 2 ^key 2 + ^key 2 ^key 3 “ ^key 2 


( 21 ) 


In general, for the /th processor, 2 < i < (q- 1), the reduced 
key equation will be of the form 


* * * 

^keyXkey^j + ^key Xkeyj + ^key^key^j ~ ^key^ 


(22) 


The system of equations on the last processor is as shown in 
figure 6. Notice that when we begin to perform the backward 
reduction (i.e., multiply the last equation by C n _ { B n _ { and 
subtract the result from the second-last equation), the C n _ x 
coefficient is eliminated. Hence, the reduced key equation on 
this processor is of the form 


Equations (20) through (23) form a block-tridiagonal system, 
whose solution is the solution to equation (19) at the key 
points, key,-, 1 < i < q. This is a system of order q and, 
because it is usually a very small system ( q is usually a small 
number), it can be solved on a single processor by using stan- 
dard serial algorithms. However, if q were large, the system 
could be solved in parallel by applying the same techniques 
that we applied to equation (19). After the solution is obtained 
at the key points, these values are sent back to the appropriate 
processors. 

Now if the solution of equation (19) were required at only a 
few points, and those points were included as key points, we 
would be done. We would have the solution for equation (19) 
at the points desired, and we would not be required to perform 
a back substitution to obtain the remaining portion of the solu- 
tion of equation (19). However, to obtain the entire solution of 
equation (19), each processor, i, uses appropriate key point 
results to solve for Xkey-^+l thr ough Xkey r l ty way of back 
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^key q _i+1 fi key q _-,+1 C key q _ 1 +1 

A key q _i +2 ®key q _i +2 C key£?1+ 2 

^key q _i+3 e key q _i+3 
^key q _i+4 


c key q _i+3 

B key q _i +4 C key q _i +4 

^keyg ®keyg ^keyq 

&n - 1 ®n-1 Pn-1 

B n 

Figure 6. — Last processor equations. 
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*key q _-|+1 
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*key q _i+2 


D key„_ 1+ 2 


*k0y Q _-|+3 


^key^-tO 


*key q _-,+4 

• 


D key q -,+4 
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• 

*key q 

• 

• 


• 

°kay a 

• 

• 


• 

Xn- 1 


• 

On- 1 


Xn 


D n 


_ 


_ 


substitution. An exception is the last processor, processor q , 
which solves for Xfc ey i+ j through x rr 

The algorithm described here is both scalable and expand- 
able. The work load can be designed to be well balanced by 
choosing key points appropriately. For example, when a paral- 
lel homogeneous system is used, all the calculation strips 
generally will be chosen to be the same size. This will 
approach a balanced work load among the processors. The 
work load for the back-substitution phase is balanced among 
all the processors, except for the last one. During the back 
substitution this processor has to execute approximately double 
the number of substitutions as any of the other processors. 

Results and Discussion 


_ + _ = o 

dx dy 

with the boundary conditions 

y = 0: u = v = 0; y = °°: u = (26) 

where u and v are the jc-component and y-component of velo- 
city, respectively; and is the free-stream velocity as shown 
in figure 7. As discussed in reference 16 (pp. 125 and 126), by 
letting / be the normalized stream function, the governing 
equations become Blasius’ equation, which is 

ff" + 2 /"' = 0 ( 27 ) 


Speedup results that were obtained from applying the parallel 
algorithm to several different fluid dynamics problems are pre- 
sented in this section. Two special flow fields with known 
results (ref. 16) were selected for evaluating the parallel 
algorithm, including the quasi-linear approach that was used in 
the algorithm development. These first two applications 
demonstrate the fast convergence rate of the parallel algorithm. 
They also show that the speedup which was realized increases 
as relative communication time decreases. Following these 
demonstrations, the parallel algorithm is used to solve a driven 
square cavity. This well-known problem displays the intricate, 
coupled-flow relationships that are found in complicated fluid 
dynamics problems. 

The first problem we consider is examining the boundary 
layer along a flat plate at zero pressure gradient. The govern- 
ing equations are 


du + ^ du _ 1 cPu 

dx dy Re Qy2 


(24) 


with the boundary conditions 

/( 0) =/'(0) = 0, /'( oo) = 1 


(28) 


The location of the infinity point, a sufficiently large number, 
was arbitrarily chosen to be 8, as also was the solution grid of 
1000 points. In only six iterations the solution is converged to 
a residual of the order 10 (see table I). As shown in 


u x 

j I 


r _ 

M 

mr* 




a 




s» 

► 




I I 

Figure 7. — Boundary layer along flat plate at zero incidence. 
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Velocity 

components 


TABLE I.— RESIDUAL 
ERROR VERSUS NUMBER 
OF ITERATIONS FOR 
BOUNDARY LAYER 
PROBLEM 


Iteration 

Residual 

(maximum) 

error 

1 

0.14258* I0 1 

2 

.26890 

3 

.28632* 10' 1 

4 

.24873* 10' 3 

5 

. 16154* 10~ 7 

6 

. 11084*10'“ 



Number of processors 


Figure 8. — Calculation time versus number of proc- 
essors for boundary layer problem. 


z 



dv , «v , dv 
u + + vv 

dr r dz 


1 

Re 





(30) 


dw , dw 1 dp 

u + vv = - _ 

dr dz p dz 


+ 


1 

Re 


d“w 1 dw d 2 w > 

T F 


(31) 


figure 8, going from two processors to six processors did not 
give the theoretical increase of 3 in execution speed. However, 
this result was expected. Because of the simple nature of the 
problem the calculation time per processor was quite short, 
and hence the communication time between processors was not 
insignificant relative to the short calculation time per 
processor. 

The second problem examined was flow near a disk rotating 
in a fluid at rest. A diagram of the problem is shown in fig- 
ure 9. A layer of fluid is carried by the disk owing to the 
action of viscous forces. The centrifugal forces in the layer 
give rise to secondary flow that is directed radially outward. 
The governing equations, as given in reference 16 (p. 93), are 

u _ v’ " + ^ du 1 dp 

dr r dz p dr 


d-u 

aT 1 


d 
d r 


d-u 

aT 1 


(29) 


du 

dr 



(32) 


where p is the density of the fluid, r is the radius of the disk, 
P is the pressure, and //, v, and w are the radial, circumferen- 
tial, and axial velocity components, respectively. The Sparrow- 
Gregg equations governing this problem (ref. 16. p. 95) are 


2F + H' = 0 

(33) 

F 2 + F'H = 0 

(34) 

G : - F" = 0 

(35) 

2 FG + HG‘ - G" = 0 

(36) 


with boundary conditions 


8 


Re 


F( 0) = H( 0) = F(oo) = G(°°) = 0, G{0) = 1 


(37) 


u x-component of velocity 

v y-component of velocity 


where F, G, and H are the normalized velocity components. 
The location of the infinity point was again arbitrarily chosen 
to be 8, and the solution grid was chosen to be 500 points. 

The algorithm converges extremely fast. In only 10 iterations 
the residual is of the order 10 as shown in table II, but this 
is still a relatively simple problem for checking the parallel 
algorithm. However, now there is a little more substance to the 
calculations per processor relative to communication than there 
was for the previous test case, as reflected by figure 10. This 
time, as shown in the figure, a relative calculation speedup of 
more than 2 was realized in going from two processors to six 
processors. 

The parallel iterative algorithm was then applied to a more 
complicated fluid dynamics problem, solving for steady-state 
flow within a driven square cavity. The cavity, which had 
three rigid wall surfaces as shown in figure 1 1, was filled with 

TABLE II.— RESIDUAL 
ERROR VERSUS NUMBER 
OF ITERATIONS FOR 
DISK ROTATING IN 
FLUID AT REST 


Iteration 

Residual 

(maximum) 

error 

1 

0.15019M0 2 

2 

.52285* 10 1 

3 

.23837*10' 

4 

.10327*10* 

5 

.39308 

6 

.94527* 10'* 

7 

.47686* 10‘ 2 

8 

. 14872* 10~ 4 

9 

.78692* 10~ 9 

10 

.60396* 10' 13 



Number of processors 

Figure 1 0. — Calculation time versus number of proc- 
essors for disk rotating in fluid at rest. 



Figure 11 . — Driven square cavity showing boundary 
conditions along with primary and secondary 
vortices. 


fluid. The fourth surface moved at constant velocity within its 
own plane, causing the fluid within the cavity to swirl. The 
defining equations are equations (1) through (3) with the 
boundary conditions that the velocity components u and v are 
zero everywhere except at the moving surface. At that surface 
the x-component of velocity, «, is 1 and the y-component of 
velocity, v, is 0. The swirling action can cause vortices to 
appear, as shown in the figure. The approach for solving this 
problem was similar to some other primitive variable 
approaches that have been devised for the two-dimensional 
driven cavity flow (ref. 17). As discussed earlier and shown in 
figure 1, the pressure, the velocity components, and the equa- 
tions were staggered over the finite-difference grid. The con- 
vection terms were discretized in a special way to preserve the 
second-order stability scheme. Because of the complicated 
nature of the flows and the strong coupling within the cavity, 
convergence was slow. 

The driven square cavity problem was solved over a 
50-by-50 finite-difference grid by using four processors in 
parallel. Compared with using only a single processor to solve 
the problem serially, a speedup of 2.96 was attained. This 
speedup corresponds to an efficiency of 74 percent for the 
four-processor simulation. Graphical outputs for the stream 
function, the horizontal component of velocity, the vertical 
component of velocity, and the pressure function were 
obtained. Typical graphical outputs are displayed in figure 12. 
The Reynolds number was 100 for the results shown in the 
figure. 

Looking to improve the efficiency being obtained, we deter- 
mined that an increase in speedup would occur if the inverses 
of the B diagonal coefficients in the forward and backward 
reductions were performed in parallel to the rest of the scheme 
rather than being done serially. The elements of the B coeffi- 
cient are available substantially before the inverse is required. 
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Figure 1 2 —Driven square cavity. Reynolds number, 1 00. 


Hence, if the inverse were performed in parallel to the main 
computations, it could be calculated while other portions of the 
main calculation path were taking place. The inverse would 
then be ready by the time it was needed in the main calcula- 
tion path, provided that it could be completed fast enough. 
And the calculation would be essentally “free” (no overhead), 
except for the data transfer time. The price for this increased 
performance is having processors dedicated to the matrix 
inversion. 


Concluding Remarks 

A new parallel numerical scheme for solving incompressible 
steady-state flow has been developed. The algorithm uses a 
finite-difference approach to solving the Navier-Stokes 
equations. The code is general and portable. Care was taken to 


minimize the effects of machine-specific items in order to ease 
porting of the code between MIMD distributed-memory 
computers. Machine-specific items, such as COMMON and 
vector processing, are not used in the code. 

The algorithms are scalable and expandable. They may be 
used with only two processors or with as many processors as 
are available. The more processors that are available, the more 
key points may be chosen in the block-tridiagonal system 
being solved, reducing the work per processor. Work can be 
balanced among the processors by the choice of key points. 
Key points need not be equally spaced, although for a homo- 
geneous system equal spacing is a desirable choice for 
balancing the work load. 

Speedup results obtained ffom applying the parallel algorithm 
to several different fluid dynamics problems were presented 
herein. Examining the boundary layer along a flat plate at zero 
pressure gradient demonstrated the fast convergence rate of the 
parallel algorithm. In only six iterations the solution converged 
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to a residual of the order 10~ 12 . Examining the flow in the 
neighborhood of a disk rotating in a fluid at rest demonstrated 
that increased speedup is obtained as relative communication 
time is decreased among the processors used. Following these 
demonstrations, the parallel algorithm was used to solve for 
steady-state flow in a driven square cavity. This well-known 
problem displays the intricate, coupled-flow relationships that 
are found in complicated fluid dynamics problems. By using 
a 50-by-50 finite-difference solution grid and four processors 
of the NASA Lewis Research Center Hypercluster, an effi- 
ciency of 74 percent (a speedup of 2.96) was obtained. 
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